Skip to content

ipc: use shared memory for large events#972

Draft
matthew-levan wants to merge 46 commits intoml/64from
ml/shm
Draft

ipc: use shared memory for large events#972
matthew-levan wants to merge 46 commits intoml/64from
ml/shm

Conversation

@matthew-levan
Copy link
Contributor

@matthew-levan matthew-levan commented Feb 27, 2026

Shared Memory Plea Protocol

Large pokes (file system commits, etc.) sent from urth to mars previously
travelled over the Unix pipe using the standard newt/jam/cue path. For
payloads above ~256 MiB this caused severe memory pressure and, for payloads
approaching 2 GiB, a process segfault.

This document describes the replacement: a POSIX shared memory fast-path that
bypasses the pipe for large events, copies the raw loom noun structure directly
(no jam/cue), and keeps peak memory well within reach of a 16 GiB machine.


Problem

The standard path for sending an event from urth to mars is:

  1. Urth jams the noun → ~2 GiB C-heap buffer (dat_y)
  2. Urth writes dat_y over the pipe (5-byte-header newt framing)
  3. Mars reads the pipe, cues the bytes back into a loom noun

For a 2 GiB file commit this requires:

  • Urth: ~2 GiB C-heap for the jammed bytes
  • Mars: ~2 GiB C-heap for the cue dictionary (transient) + ~2 GiB loom for
    the decoded noun + ~2 GiB loom for the re-jammed LMDB event = ~6 GiB peak
  • The pipe itself becomes a congestion point at multi-GiB sizes

Design

Protocol

A new %plea message type is added to the urth↔mars IPC protocol (alongside the
existing %poke, %peek, %live, etc.):

urth → mars   [%plea len=@ud]          request: allocate shm of len bytes
mars → urth   [%plea nam=@t len=@ud]   response: shm name + confirmed length
urth → mars   [%done ~]               urth has filled shm; proceed

The normal %poke response from mars back to urth is unchanged; the plea writ
is converted to a poke writ in-place before %done is sent so that mars's
eventual [%poke ...] reply matches through the standard writ-queue path.

Threshold

The plea path is taken when the serialized noun size exceeds _UNIX_PLEA_THRESHOLD
(256 MiB), currently triggered from _unix_update_mount in pkg/vere/io/unix.c.

Shared memory ownership

  • Mars creates the shm object (shm_open + ftruncate), sends the name to
    urth, then waits for %done.
  • Urth opens the same shm by name, calls the fill callback (writes the noun),
    then munmaps and sends %done. Urth never owns the shm region.
  • Mars receives %done, mmaps the shm read-only, deserializes the noun, then
    munmaps and shm_unlinks before continuing.

Noun serialization: raw loom copy (no jam/cue)

Instead of jam/cue, the shm buffer holds a compact binary encoding of the raw
loom noun structure, implemented in pkg/noun/allocate.c:

u3a_noun_shm_size(u3_noun som) → c3_d
: DFS traversal counting bytes needed. Handles DAG sharing via a
ur_dict64_t (loom offset → sentinel). Returns total byte length including
the 16-byte header.

u3a_noun_to_shm(u3_noun som, c3_y* shm_y, c3_d cap_d) → c3_d
: Iterative post-order DFS. Writes each unique indirect object (atom or cell)
exactly once in child-before-parent order. Returns bytes written.

u3a_noun_from_shm(const c3_y* shm_y, c3_d len_d) → u3_weak
: Single-dict two-phase deserializer (see below). Returns the root noun
allocated on the current road, or u3_none on error.

SHM buffer format

offset  size  field
     0     8  root_noun   -- root noun in shm-offset space (see tags below)
     8     8  data_len    -- byte length of data section
    16  ...   data        -- allocations in DFS post-order

Noun values in shm-offset space use the top two bits as a tag:

  • 00xxxxxxx… — direct atom (fits in 62 bits); stored as-is
  • 10xxxxxxx… — indirect atom; low 62 bits = byte offset into data section
  • 11xxxxxxx… — cell; low 62 bits = byte offset into data section

Each atom entry in the data section:

+0   8  len_w    -- number of 64-bit data words (> 0 distinguishes from cell)
+8   4  mug_h    -- cached mug
+12  4  (pad)
+16  len_w*8     -- atom data words, LSB-first

Each cell entry in the data section:

+0   8  tag=0    -- zero distinguishes from atom
+8   4  mug_h
+12  4  (pad)
+16  8  hed      -- head noun in shm-offset space
+24  8  tel      -- tail noun in shm-offset space

u3a_noun_from_shm: single-dict two-phase approach

A single ur_dict64_t serves both phases, halving peak C-heap vs a two-dict
approach:

  • Phase 1 (linear scan): for each cell entry, count how many times each
    shm offset appears as hed, tel, or root. Store dict[shm_off] = refcount.
  • Phase 2 (linear scan): for each entry, read use_d = dict[shm_off],
    allocate the loom noun with use_w = use_d, then overwrite
    dict[shm_off] = loom_noun.

This is safe because data is written in post-order: when phase 2 encounters a
cell, both children have already been processed and their dict entries already
hold the resolved loom nouns.

Dict pre-sizing

Large nouns would cause many costly resize generations. Both u3a_noun_to_shm
and u3a_noun_from_shm pre-size their ur_dict64_t via _shm_dict_init, which
picks the smallest fibonacci pair such that the initial bucket count can hold the
estimated node count (dat_d / _SHM_CELL_SIZE) without resizing. The fibonacci
table in pkg/ur/defs.h was extended from ur_fib34 through ur_fib36 to cover
the required range.


Files Changed

File Change
pkg/c3/motes.h Added c3__plea mote
pkg/ur/defs.h Added ur_fib29ur_fib36
pkg/vere/vere.h Added u3_writ_plea enum value; pla_u struct in u3_writ union; u3_lord_plea() declaration
pkg/vere/mars.h Added u3_mars_plea_e state; pla_u struct in u3_mars
pkg/vere/lord.c _lord_plea_plea() handler; u3_lord_plea() public API; %plea dispatch in writ machinery
pkg/vere/mars.c %plea and %done cases in _mars_work(); state guard in u3_mars_kick()
pkg/vere/io/unix.c _unix_plea_ctx, _unix_plea_fill, plea branch in _unix_update_mount()
pkg/noun/allocate.h Declarations for u3a_noun_shm_size, u3a_noun_to_shm, u3a_noun_from_shm
pkg/noun/allocate.c Full implementations of the above; _shm_dict_init helper

Test Results: 2 GiB File Commit

Tested on Apple M-series (ARM64, macOS 26.3), 16 GiB RAM, --urth-loom 34
(16 GiB virtual loom). The commit consisted of a single ~2 GiB binary file
written via the Clay Unix mount. vmmap snapshots taken immediately after
the commit completed (both processes idle, LMDB write in progress).

Process:  urbit [31135]  (urth)    Launch: 18:31:01  Sample: 18:31:53
Process:  urbit [31136]  (mars)    Launch: 18:31:01  Sample: 18:31:54

Urth (31135)

Physical footprint:         6.1 GiB  (peak: 8.0 GiB)

MALLOC_LARGE                78.5 MiB   1 region   (live — LMDB dat_y buffer)
MALLOC_LARGE (empty)         2.0 GiB  18 regions  (freed shm serialize dict)
VM_ALLOCATE                  4.1 GiB  35 regions  (urth loom dirty pages)
DefaultMallocZone           79.3 MiB  live         (incl. LMDB buffer)
  • The 2.0 GiB freed dict (18 regions) reflects the pre-sized old_to_shm
    dict from u3a_noun_to_shm. Before pre-sizing this was 33 regions and
    was still live (4.0 GiB) when sampled mid-serialize.
  • The 78.5 MiB live MALLOC_LARGE is the jammed event buffer (u3_feat::hun_y
    in disk.c) held pending async LMDB write — unavoidable given the jam-based
    event log.
  • The 4.1 GiB VM_ALLOCATE is the urth loom's dirty pages (~2 GiB noun +
    working set), all of which are MAP_ANON | MAP_PRIVATE.

Mars (31136)

Physical footprint:        12.6 GiB  (peak: 14.8 GiB)

MALLOC_LARGE (empty)        3.9 GiB  32 regions  (freed dicts — see below)
VM_ALLOCATE                 8.7 GiB  72 regions  (mars loom + shm region)
DefaultMallocZone           3.6 MiB  live         (no malloc leak)
mapped file                 1.0 TiB  virtual      (LMDB mmap, 13 MiB resident)
  • The 3.9 GiB freed MALLOC_LARGE is a mix of:
    • The shm decode dict (u3a_noun_from_shm, pre-sized, few resize generations)
    • The jam dict from u3qe_jam (starts at fib11/fib12, grows through many
      generations for a 2 GiB noun — this is the dominant contributor and is
      independent of the plea protocol)
  • Peak 14.8 GiB vs current 12.6 GiB: the ~2.2 GiB delta is the jam output
    buffer (dat_y) freed after the async LMDB write completed.
  • The 3.6 MiB live DefaultMallocZone confirms no malloc leak from the
    plea/decode path.

Peak breakdown

Phase Approximate cost
Mars loom (decoded 2 GiB noun + Arvo state) ~6 GiB dirty loom pages
Shm region (owned by mars) ~2 GiB VM_ALLOCATE
Shm decode dict (peak, pre-sized, freed) ~1–2 GiB
Jam output for LMDB (transient, freed) ~2 GiB
Urth loom dirty pages ~4 GiB
Urth serialize dict (transient, freed) ~2 GiB
Combined peak (urth + mars) ~14–15 GiB

Comparison: old pipe path vs plea protocol

Metric Pipe (jam/cue) Plea (raw loom copy)
Urth C-heap (jam buffer) ~2 GiB permanent until LMDB done ~2 GiB (same — LMDB buffer)
Urth serialize overhead none (jam is the buffer) ~2 GiB transient dict
Mars cue dict (transient) ~6 GiB (two ur_dict64_t) ~1–2 GiB (one pre-sized dict)
Mars decoded noun on loom ~6–8 GiB ~6–8 GiB (same)
Pipe congestion yes — 2 GiB over a Unix pipe eliminated
Segfault on 2 GiB commit yes no

Known Limitations / Future Work

  • Jam dict pre-sizing: Mars's residual ~2 GiB in MALLOC_LARGE (empty) is
    dominated by u3qe_jam's internal dict (used when writing the decoded noun to
    the event log). Pre-sizing that dict from the known noun size would reduce mars
    peak by ~1–2 GiB.
  • Threshold tuning: The 256 MiB threshold is conservative. A lower value
    (e.g. 64 MiB) would engage the plea path more aggressively but the per-call
    overhead (shm creation, two IPC round-trips) is small.
  • Linux testing: All measurements above are macOS ARM64. The shm path uses
    standard POSIX interfaces (shm_open, mmap, munmap, shm_unlink) and
    should be portable, but has not yet been profiled on Linux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants